Introduction

Our motivation in choosing this data-set for further analysis was due to problems that we hope we can answer:

  1. Does various predicting factors which has been chosen initially really affect the Life expectancy?

  2. What are the predicting variables actually affecting the life expectancy?

  3. Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan? How does Infant and Adult mortality rates affect life expectancy?

  4. Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.

  5. What is the impact of schooling on the lifespan of humans?

  6. Does Life Expectancy have positive or negative relationship with drinking alcohol?

  7. Do densely populated countries tend to have lower life expectancy?

  8. What is the impact of Immunization coverage on life Expectancy?

  9. Do the sample gives enough evidence to say that Developed countries have more average life expectancy than Developing countries?

  10. Do the countries that spend a higher proportion of their resources on human development have a higher life expectancy?

  11. What is the most frequent range of life expectancy?

1.Obtaining Data

For this project, we obtained the Life Expectancy dataset from Kaggle link. The health factors data was collected from the WHO data repository website, and the corresponding economic data was obtained from the United Nations website with the assistance of Deeksha Russell and Duan Wang. The dataset was collected for 193 countries between the years 2000-2015, and it consists of 2938 observations and 22 attributes, of which 20 are meant to be predicting variables. These predicting variables have been divided into several broad categories, including immunization-related factors, mortality factors, economical factors, and social factors.

20 real valued features:

It’s worth mentioning that the data frame contains some missing values for attributes such as Hepatitis B, Alcohol, GDP, and others. Additionally, some countries, such as Vanuatu, Tonga, Togo, Cabo Verde, etc., have been excluded from the dataset because they had too many missing values, which would negatively impact the result.

2. Importing data set and cleaning data

2.1 Missing values

2.1.1 Data Preprocessing

In this step, we examine the presence of missing values in our dataset and perform necessary preprocessing. Approximately 43.87% of the dataset contains missing values, which is nearly half of the data. It is important to analyze how missing values are distributed across different attributes.

Upon analyzing the data, we found that certain attributes have a significant number of missing values. The highest number of missing values is observed in the attributes of Population, GDP, Hepatitis B, followed by Total Expenditure, Alcohol, Income Composition of Resources, and Schooling.

The attribute with the most missing values is Population. Dealing with missing data presents several challenges, and one approach is to remove the missing entries. However, in our case, this would result in discarding a substantial portion of our dataset, which could adversely affect the accuracy of future predictions. Another option is to replace missing values with the mean or median of the Population variable. However, due to the wide range of values in this feature, such an approach would introduce inaccuracies.

After careful consideration, we decided to conduct further research and obtained actual values for most of the missing data from The World Bank website (link. The next step involves importing a new dataset and replacing the missing values with the data retrieved from The World Bank site.

2.1.2 Handling the Population Attribute

We successfully obtained actual values for a substantial portion of the missing data, and as a result, the number of missing values in the Population attribute has been reduced to 50 entries. Consequently, we can now proceed to remove these remaining missing values from our dataset.

2.1.3 Handling Missing Values in Other Attributes

After successfully addressing the missing values in the Population and GDP attributes, we still have other attributes that contain missing values. For these attributes, we were unable to find missing values from external sources. Hence, we need to employ different techniques to handle them, such as imputation with mean, imputation with median, or combined imputation, depending on the nature of the data.

To handle the missing data, we will utilize mean or median imputation. Mean imputation is suitable for attributes that follow a normal or approximately symmetric distribution without significant outliers. On the other hand, median imputation is more appropriate for attributes with skewed distributions and significant outliers. To assess the distribution of each attribute, we will plot histograms for visualization.

Based on the previous histograms, we can assume that attributes such as Life Expectancy, Total Expenditure, Income Composition of Resources, and Schooling exhibit a bell-shaped normal curve. Therefore, these attributes are potential candidates for mean imputation. However, before applying mean imputation, we also need to examine the presence of outliers in each attribute to ensure that our imputation process is not influenced by extreme values.

To identify outliers in each attribute, we will utilize boxplots. Upon analyzing the previous boxplots, we observe that out of the four candidates identified earlier, only Schooling and Income Composition of Resources are more suitable for mean imputation. For the remaining 15 attributes, we have decided to proceed with median imputation.

Furthermore, we check the number of NA values in the dataset again to gain a better understanding of the extent of missing data.

2.2 Outliers

In this stage, we have confirmed that our dataset does not contain any missing values (NA-values). Hence, we can proceed to the next step of processing outliers. There are several methods available for outlier detection, including visual techniques like boxplots and histograms, as well as statistical methods such as Tukey’s Method. Once outliers have been identified using these methods, it is important to preprocess them accordingly. Several techniques can be employed for outlier preprocessing, including:

  1. Dropping outliers: One approach is to remove the outliers from the dataset entirely, excluding them from subsequent analyses. This can be appropriate when outliers are deemed as data errors or extreme values that do not align with the overall pattern of the data.

  2. Limiting/Winsorizing outliers: Instead of eliminating outliers, this technique involves capping or replacing outlier values with predefined limits. By setting a threshold, the extreme values are brought within an acceptable range while retaining their relative position in the distribution. Winsorizing is a common variation of this approach.

  3. Transforming the data: Another strategy is to apply mathematical transformations to the data, such as taking logarithms, inverses, square roots, or other suitable transformations. These transformations can help normalize the distribution and mitigate the impact of outliers on subsequent analyses.

Our plan is to describe and apply these three methods. For each method, we will evaluate the results of the model. The final decision on the best method will be based on the performance of the models. The first step is to plot boxplots for each attribute in our dataset, which will help us identify potential outliers.

Metric Number of outliers Percent of data that is outlier
Life.expectancy 19 0.67
Adult.Mortality 88 3.09
infant.deaths 329 11.56
Alcohol 2 0.07
percentage.expenditure 369 12.97
Measles 521 18.31
five.deaths 0 NaN
Hepatitis.B 313 11
HIV.AIDS 546 19.19
BMI 0 0
Polio 259 9.1
Total.expenditure 42 1.48
Diphtheria 282 9.91
GDP 491 17.26
Population 405 14.24
thinness..1.19.years 109 3.83
thinness.5.9.years 106 3.73
Income.composition.of.resources 116 4.08
Schooling 62 2.18

Upon examining the attributes, we observed that BMI is the only attribute that does not exhibit any outliers. However, attributes such as Alcohol, Life Expectancy, and Income Composition of Resources contain a relatively small number of outliers. Consequently, completely dropping these outliers might not be the most appropriate solution for this particular problem.

To address this, we will begin by utilizing the z-score method for identifying and removing outliers.

  1. Z-score method The z-score method calculates the number of standard deviations an observation is away from the mean. By applying a threshold, we can identify data points that deviate significantly from the expected values. These identified outliers can then be further processed using the previously mentioned outlier preprocessing techniques.

Implementing the z-score method will allow us to effectively handle the outliers in the dataset, ensuring that they do not unduly influence the subsequent analyses.

  1. Tukey’s Fences Tukey’s method, also known as Tukey’s fences, defines upper and lower bounds based on the interquartile range (IQR). Data points that fall beyond these bounds are considered potential outliers. This method provides a robust approach to identifying outliers, as it is less sensitive to extreme values.

    Comparing these two models, we can see that the second model, which uses the original data dataset, has a higher adjusted R-squared value (0.8168) compared to the first model (0.7602). This indicates that the second model explains a greater proportion of the variance in the dependent variable (life expectancy) based on the independent variables.

    Additionally, the second model has a slightly higher residual standard error (4.042) compared to the first model (2.704), suggesting that the second model’s predictions have slightly more error or variability around the regression line. Therefore comparing IQR method and Z-score the second one is winning.

  2. Imputation + outliers

  3. Winsorisation Winsorization is another outlier preprocessing technique that involves replacing extreme outlier values with less extreme values within a predefined range. This approach helps to mitigate the impact of outliers while still retaining the relative position of the data points in the distribution.

let’s examine the Adjusted R-squared values, as they account for the number of predictors in each model:

Model Adjusted R-squared
data_winsorize 0.8722
data_z 0.8638
data_Tukey 0.7602
data_impute_out 0.8159

Based on the Adjusted R-squared values, we can observe that both the data_winsorize and data_z models have the highest values among the five models, with Adjusted R-squared values of 0.8722 and 0.8638, respectively. These higher values indicate that these models explain a larger proportion of the variance in the dependent variable compared to the other models.

However, in addition to the Adjusted R-squared values, we also need to consider the distribution of the life expectancy variable after applying the model. In our research, the assumption of a normal distribution is crucial. It is important to note that winsorization, one of the techniques used in the data_winsorize model, may potentially disturb the normality distribution of the life expectancy variable.

Taking this into consideration, we will choose the data_z model as it also provides a high Adjusted R-squared value (0.8638) while preserving the normality assumption of the life expectancy distribution.

2.3 Data Transformation

2.3.1 Factorizing categorical variables:

Before proceeding with the data exploration phase, it is important to perform necessary preprocessing steps to ensure accurate and meaningful analysis. One such step is the factorization of the Status variable.

Since our analysis focuses on predicting Life Expectancy, it is crucial to ensure that this attribute follows a normal distribution.Therefore we re interested in having Life Expectancy attribute normal distribution.

2.3.2 Transforming the Life Expectancy variable:

In order to improve the distribution and address any potential skewness, we applied a square root transformation to the Life Expectancy values. This transformation helps in normalizing the data and reducing the impact of extreme values.

## Skewness: 0.0665222231245973
## kurtosis : 2.941852

The skewness value of -0.048 is close to zero, indicating that the distribution is approximately symmetric. A skewness value close to zero suggests that the data is relatively normally distributed or very close to it in terms of symmetry.

The kurtosis value of 2.94 is slightly less than 3. This suggests that the distribution has slightly heavier tails and potentially a slightly sharper peak compared to a normal distribution. However, a kurtosis value of 2.94 is still relatively close to 3, indicating that the distribution is not significantly different from a normal distribution in terms of its tail behavior.

Overall, based on the skewness and kurtosis values provided, the data appears to have a reasonably symmetric distribution and is relatively close to a normal distribution in terms of both skewness and kurtosis.

2.3.3 Scaling the numeric variables

Another important aspect of data preprocessing is scaling the variables. It enables us to compare and analyze variables with different scales and units without any dominance based on their magnitudes. Scaling is particularly beneficial when working with algorithms that are sensitive to variable scales, such as regression models or distance-based algorithms.

By scaling the variables, we enhance interpretability and facilitate comparison. Scaling also aids in visualizing the data and identifying patterns or relationships between variables more effectively and ensuring consistency in scale across all variables.

By performing factorization of the Status variable, transforming the Life.expactancy variable, and finally scaling the numeric variables, we have prepared the dataset for further exploration and analysis, setting the stage for uncovering meaningful insights and relationships within the data.

3. Exploration of the data

Here we want to use methods both Univariate and Bivariate analysis. Our goals:

  1. Exploring the relationship between continuous variables and the target variable (life expectancy) as well as their interrelationships.

  2. Investigating the impact of categorical variables on the target variable (life expectancy).

  3. Examining the relationship between the variables “Country Status” and “Year” with continuous variables. Note that due to the dataset containing a large number of countries with small sample sizes, making country-to-country comparisons may not provide significant insights.

3.1 Numerical variables

3.1.1 Univariate Analysis

Univariate analysis is looking at the data for each variable on its own. This is generally done best by using histograms for continuous data, count/barplots for categorical data and of course by getting the descriptive stats by using .summery().

As you can see, life expectancy, total expenditure, Income.composition.of.resources and schooling looks like having normal distributions.

Let’s check normality with qq-plots

3.1.2 Bivariate Analysis

  1. Continuous variables compared to the life expectancy (target variable) and to one another
  2. Categorical variables compared to the life expectancy (target variable)
  3. Comparison of Country Status and Year to Continuous variables (country has an extremely large number of values with small sample sizes, so country comparisons aren’t especially helpful for this dataset)

This scatter plot shows that ‘Schooling’, ‘Income composition of resources’ and ‘BMI’ have a strong positive correlation with Life Expectancy. On the other hand ‘Adult Mortality’, ‘HIV/AIDS’ have a negative correlation with Life Expectancy.

In our analysis, we used the correlation matrix to explore the relationships among the scaled variables. The matrix was visualized as a heatmap, where darker or lighter shades indicated stronger correlations. This allowed us to identify clusters or groups of variables that were highly correlated. By calculating the correlation coefficients between pairs of variables, we gain insights into the strength and direction of their linear associations.

After scaling the variables, we examined the correlation between them and identified several weak correlations:

  • Schooling and Income Composition of Resources
  • Thinness 5-9 years and Thinness 1-19 years
  • Under-five deaths and Infant deaths
  • Poilo and Diphtheria

While these variables exhibit some correlation, it does not necessarily indicate collinearity among independent variables. Collinearity refers to a high degree of correlation between independent variables, which can pose challenges in statistical analysis, particularly in linear models.

To assess the potential collinearity, we recommend calculating the Variance Inflation Factor (VIF) for the variables in a linear model. The VIF helps identify if collinearity is present by quantifying the inflation in the variances of the regression coefficients. A VIF value exceeding 5 suggests a problem with collinearity, indicating that one or more variables are highly correlated with each other.

If high VIF values are observed, it is advisable to address the collinearity issue by eliminating one of the correlated variable pairs. This step helps mitigate the impact of collinearity and improves the stability and interpretability of the regression model.

It’s important to note that weak correlations between variables do not necessarily indicate collinearity. Therefore, conducting further analysis, such as calculating the VIF, is crucial to identify and address any potential collinearity issues in the dataset.

After observing the correlation matrix, it becomes evident that three pairs of variables display high Variance Inflation Factors (VIFs). In order to address this issue of collinearity, we will proceed by omitting one variable from each correlated pair based on their VIF values.

Specifically, the variables “infant.deaths” and “under.five.deaths” exhibit VIFs that significantly exceed the threshold of 5, indicating a strong collinearity. To resolve this, we will remove the variable “under.5.deaths” since it possesses a higher VIF value.

Similarly, we will eliminate the variable “Income.composition.of.resources” due to its higher VIF value.

Furthermore, we will exclude the variable “thinness.5.9.years” as it demonstrates a higher VIF value.

The updated version of the data, after removing these three features, will be saved as “data_EDA.” This modified dataset will be utilized as a dataframe for further analysis in the “Model_EDA” section.

The correlation matrix now shows no suspicious coeffecients that might indicate collinearity between the features. Upon closer examination, it becomes evident that the VIF values fall within an acceptable range, all being below the threshold of 5. With this observation, we can confidently state that the data_EDA is now prepared and suitable for further analysis in the Model_EDA phase.

3.1.3 Continuous to Life Expectancy comparison(ANOVA)

To check if the continuous variables influence Life Expectancy we apply ANOVA Test to each variable.

For each variable we will categorize countries into one of the three categories: ‘Low’, ‘Medium’ and ‘High’ depending on the country’s average for that certain feature.

First we group the data by country and find the average life expectancy over the 16 years and we compute the average for the feature we want to test.

We are going to get a new dataframe having average life and level of the tested feature (low, medium, or high) as columns and each row corresponding to one among the 193 countries in the dataset.

We then apply the ANOVA Test, where the null hypothesis is H0: mu_low = mu_medium = mu_high and the alternate hypothesis is that not all the means are equal.

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## Adult.Mortality   2   9726    4863     168 <2e-16 ***
## Residuals       184   5326      29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Alcohol       2   3338  1669.0   26.22 9.59e-11 ***
## Residuals   184  11713    63.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                         Df Sum Sq Mean Sq F value Pr(>F)    
## Percentage_Expenditure   2   6155  3077.3   63.65 <2e-16 ***
## Residuals              184   8897    48.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## Hepatitis_B   2   1033   516.6   6.781 0.00144 **
## Residuals   184  14018    76.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Measles       2   2436  1217.8   17.76 8.85e-08 ***
## Residuals   184  12615    68.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value Pr(>F)    
## BMI           2   7029    3514   80.61 <2e-16 ***
## Residuals   184   8022      44                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                    Df Sum Sq Mean Sq F value  Pr(>F)   
## Total_Expenditure   2   1024   511.8   6.713 0.00153 **
## Residuals         184  14028    76.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value Pr(>F)    
## HIV           2   9467    4733     156 <2e-16 ***
## Residuals   184   5584      30                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value Pr(>F)    
## GDP           2   5962  2981.2   60.36 <2e-16 ***
## Residuals   184   9089    49.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              Df Sum Sq Mean Sq F value Pr(>F)
## Population    2    186   93.23   1.154  0.318
## Residuals   184  14865   80.79
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Thinness_1.19   2   7238    3619   85.22 <2e-16 ***
## Residuals     184   7813      42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Thinness_5.9   2   5482    2741   52.71 <2e-16 ***
## Residuals    184   9569      52                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable, and have a decent standard of living. Using the sample, check if countries that spend a higher proportion of their resources on human development have a higher life expectancy?

We will be using the ANOVA test to test the significance of Human Development Index (HDI) on life expectancy. Here we will categorize countries into one of the three categories: ‘Low’ (≤0.5), ‘Medium’(>0.5 and ≤0.7), ‘High’ (>0.7) depending upon the country’s average schooling years.

Firstly, we will group the data by country and find the average life expectancy and Income.composition.of.resources for each country over the 16 years.

##                                 Df Sum Sq Mean Sq F value Pr(>F)    
## Income_Decomposition_Resources   2   8389    4195   115.9 <2e-16 ***
## Residuals                      184   6662      36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the countries with higher income composition of resources for human development have better life expectancy hence the p-value <0.05. Thus countries should spend more on the human development to achieve higher life expectancy.

  • Education creates awareness about healthy living. For example Vaccine hesitancy during this Covid-19 period, especially among the rural population, has highlighted the importance of education. Using the sample, test whether schooling years (average) has a significant impact on life expectancy?

Which test to use?

We will be using the ANOVA test to test the significance of education on life expectancy. Here we will categorize countries into one of the three categories: ‘Low’ (≤8), ‘Medium’(>8 and ≤12), ‘High’ (>12) depending upon the country’s average schooling years.

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Education     2  74.73   37.37   76.04 <2e-16 ***
## Residuals   184  90.41    0.49                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, all the tests return a p-value lower than 0.05, except the one for Population. This means that all the variables, except Population, have effect of Life Expectancy.

  • In Covid-19 times we all have seen the importance of immunization against the virus to increase life expectancy. Looking at the data, can you show that immunization against Polio and Diphtheria has a significant effect on life expectancy?

We will use a two-way ANOVA test. Here we will divide the countries into two categories for both Polio and Diphtheria. Countries having values of % immunization coverage for one-year-old greater than the median value will get category ‘High’ else ‘Low’.

Step1: Countries with polio (mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.

Step2: Countries with Diphtheria(mean) coverage for one-year-old ≤85 will get a label ‘Low’ else ‘High’.

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Polio         1   4346    4346   87.34  < 2e-16 ***
## Diphtheria    1   1548    1548   31.10 8.64e-08 ***
## Residuals   184   9157      50                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value for both Polio and Diphtheria immunization coverage for one-year old is less than 0.05, hence we can say that immunization has a significant impact on the life expectancy.

  • We want to compare the proportions of infant deaths and under-five deaths. Since we have observed a high correlation in the correlation matrix, we would like to determine if there is a significant difference between these two variables. Can we conclude that there is a statistically significant difference between the proportions of infant deaths and under-five deaths?

We will conduct a two-proportions z-test to compare the two independent proportions. Firstly, we will group the data by country and then find the average life expectancy, infant deaths, and under-five deaths for each country.

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  arg1 out of arg2
## X-squared = 1.767, df = 1, p-value = 0.1838
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.027211995  0.005211995
## sample estimates:
## prop 1 prop 2 
##  0.030  0.041

Since the p- value is greater than 0.05, we see no significant difference in the two independent proportions.

  • Does life expectancy have a positive or negative correlation with habits like drinking alcohol? Does the result point out any strong conclusion?

To assess the relationship between alcohol consumption and adult mortality rate, we can employ two approaches. First, we can create a scatter plot to visualize the data points and discern any potential correlation between the variables. Second, we can conduct a Pearson correlation test to quantitatively measure the correlation strength between alcohol consumption and adult mortality rate. By employing these methods, we can gain a deeper understanding of the association between these variables.

## 
##  Pearson's product-moment correlation
## 
## data:  data3$Average_Adult_Mortality and data3$Average_Alcohol
## t = -3.8229, df = 185, p-value = 0.00018
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3986006 -0.1322244
## sample estimates:
##        cor 
## -0.2705838
## 
##  Pearson's product-moment correlation
## 
## data:  data2$Average_Life and data3$Average_Alcohol
## t = 6.6054, df = 185, p-value = 4.086e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3129765 0.5461110
## sample estimates:
##       cor 
## 0.4368508

The correlations between alcohol consumption and health indicators yield mixed results. While there is a positive correlation with life expectancy, there is a negative correlation with adult mortality rate. These correlations, however, are not strong enough to draw remarkable conclusions. Further research is needed to better understand the complex relationship between alcohol consumption and these health outcomes.

3.2 Categorical Variables to Life Expectancy Comparison

To answer the first question we need to see if any of the variables have effect on the life expectancy, we want to look at the categorical features first. We’ll start with ‘Status’.

First we divide the dataset into developing countries and developed countries and for each country we compute the mean of the Life Expectancy values obtained through the years.

First we want to check if the variance of the developed countries is the same as the variance of the developing countries. For this we use a F-test.

## 
##  F test to compare two variances
## 
## data:  developed$Average_Life and developing$Average_Life
## F = 0.13941, num df = 31, denom df = 146, p-value = 2.241e-08
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.08406763 0.25570670
## sample estimates:
## ratio of variances 
##          0.1394094

As we can see, the p-value is lower than 0.05, so reject the null hypothesis and accept the alternate hypothesis that the variances of two populations are different.

Now we want to see if the developed countries have a higher average life expectancy than Developing countries. For this we use a two sample t-test.

## 
##  Welch Two Sample t-test
## 
## data:  developed$Average_Life and developing$Average_Life
## t = 13.165, df = 134.02, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  10.49488      Inf
## sample estimates:
## mean of x mean of y 
##  79.19785  67.19263

As we can see, the p-value is smaller than 0.05, so we reject the null hypothesis and accept the alternate hypothesis that the developed countries have a higher average life expectancy than the developing countries.

Based on the result of the above t-test, there appears to be a very significant difference between ‘Developing’ and ‘Developed’ countries with respect to their Life Expectancy. Since this is the case, a comparison between the status variable and all other continuous variables should be made before moving to the feature engineering phase.

–Status Variable Compared to other Continuous Variables–

Since the status variable only contains two different values, it is likely best to compare a number of descriptive statistics for those two values with respect to all the other continuous variables.

Variable p-value
Life.expectancy 9.013849e-314
Adult.Mortality 1.249538e-164
infant.deaths 7.347845e-37
Alcohol 9.458846e-192
percentage.expenditure 9.887803e-38
Measles 1.717022e-16
under.five.deaths 1.596958e-38
Hepatitis.B 9.284856e-18
HIV.AIDS 3.950474e-61
BMI 6.883171e-65
Polio 8.299911e-75
Total.expenditure 3.593352e-36
Diphtheria 1.851033e-62
GDP 5.671048e-06
Population 0.2783308
thinness..1.19.years 2.116436e-301
thinness.5.9.years 2.732105e-296
Income.composition.of.resources 1.006483e-303
Schooling 7.433698e-206

Based on the results, it is evident that there are significant differences between the following variables concerning a country’s status. This conclusion is drawn as none of the calculated p-values exceed 0.5.

—-Life expectancy over the years—-

We aim to investigate the trend of life expectancy over the years.

While there appears to be a positive correlation between life expectancy and the passage of years, it is essential to determine whether the differences observed between each year are statistically significant. Are these differences substantial enough to consider them meaningful variations in life expectancy?

Time Period P-value
2000 to 2001 0.7413284
2001 to 2002 0.8458235
2002 to 2003 0.949058
2003 to 2004 0.7579069
2004 to 2005 0.6189956
2005 to 2006 0.6110444
2006 to 2007 0.7044814
2007 to 2008 0.813566
2008 to 2009 0.6288716
2009 to 2010 0.8541843
2010 to 2011 0.5278882
2011 to 2012 0.7830066
2012 to 2013 0.7871843
2013 to 2014 0.7159781
2014 to 2015 0.9554911

Based on the results of the conducted t-tests, the p-values obtained for all comparisons between consecutive years are greater than 0.05, indicating that there is no significant evidence to support the presence of substantial differences in Life Expectancy between these years.

4. Models selection and their comparison

To apply linear regression we need to make sure that four conditions are satisfied:

  1. No multicollinearity: no high correlation between the independent variables;
  2. Linearity: there must be a linear relationship between the target variable (Life Expectancy) and the other variables;
  3. Normality: the residuals must be normally distributed;
  4. Homoscedasticity: the residuals must have a constant variance;

The first condition is already satisfied as we already removed the variables ‘infant.deaths’, ‘under.five.deaths’, ‘GDP’ and ‘thinness..1.19.years’, which are the variables that have a higher VIF value, so the ones with strong collinearity.

We start by building the linear model

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Status + Alcohol + 
##     percentage.expenditure + Hepatitis.B + Measles + BMI + Polio + 
##     Total.expenditure + Diphtheria + HIV.AIDS + Population + 
##     thinness..1.19.years + Income.composition.of.resources + 
##     Schooling, data = data_EDA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4078 -0.2363  0.0225  0.2719  1.8165 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.234292   0.026605  -8.806  < 2e-16 ***
## Adult.Mortality                  0.252827   0.011330  22.314  < 2e-16 ***
## StatusDeveloped                  0.285710   0.030692   9.309  < 2e-16 ***
## Alcohol                          0.003227   0.011558   0.279 0.780106    
## percentage.expenditure          -0.093635   0.010004  -9.359  < 2e-16 ***
## Hepatitis.B                      0.028586   0.009885   2.892 0.003859 ** 
## Measles                          0.032791   0.009162   3.579 0.000351 ***
## BMI                             -0.067149   0.011234  -5.977 2.55e-09 ***
## Polio                           -0.065743   0.011906  -5.522 3.66e-08 ***
## Total.expenditure               -0.023772   0.009391  -2.531 0.011413 *  
## Diphtheria                      -0.091262   0.012498  -7.302 3.67e-13 ***
## HIV.AIDS                         0.200055   0.010270  19.480  < 2e-16 ***
## Population                      -0.020474   0.009241  -2.215 0.026807 *  
## thinness..1.19.years             0.056147   0.011323   4.959 7.52e-07 ***
## Income.composition.of.resources -0.150204   0.014989 -10.021  < 2e-16 ***
## Schooling                       -0.242062   0.015793 -15.327  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4599 on 2829 degrees of freedom
## Multiple R-squared:  0.7896, Adjusted R-squared:  0.7885 
## F-statistic: 707.8 on 15 and 2829 DF,  p-value: < 2.2e-16

We should preface this by saying that we don’t have to prove the four assumptions are “perfectly met”, but we need to see to which extent they are violated and see if we can get results that can be considered satisfying.

We start by checking the second condition and we do it by producing the Residuals vs Fitted plot.

To say that we have linearity the points should be evenly distributed between the two sides of the line and the red line should be approximately horizontal at zero and. The presence of a pattern may indicate a problem with some aspect of the linear model.

In our case, there is no pattern in the residual plot. This means that we can assume linear relationship between the predictors and the outcome variables.

Let’s now check the third condition.

Looking at the Q-Q plot above, we see that most of the points are located on the diagonal line, except the extremes, which deviate from the line, therefore, this Q-Q plot is inconclusive regarding the normality of the residuals. So we need to find another way to check if the normality condition is met, let’s try by plotting the histogram of the residuals.

The histogram shows that most of the residuals fall around zero and the number of observations in the tails (so the extremes) of the histogram is low. We can conclude that residuals of our regression model follow a normal distribution.

Let’s now check the last condition. To check this homoscedasticity assumption we can use the Breusch-Pagan test.

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 188.95, df = 15, p-value < 2.2e-16

As we can see, p <0.05 so there is evidence that the principle of Homoscedasticity is not fulfilled. When the principle of Homoscedasticity is not fulfilled, the estimate of the mean made by the model will continue to be good, but its confidence intervals will not.

We can also see to which extent the Homoscedasticity is violated by looking at the Scale-Location plot.

This plot shows if residuals are spread equally along the ranges of predictors, so we need to get a line which is close to being horizontal with equally spread points, which is our case. This means that the Homoscedasticity is violated but not to a large extent.

##  lag Autocorrelation D-W Statistic p-value
##    1       0.6455986     0.7066183       0
##  Alternative hypothesis: rho != 0

From the output obtained above, we can see that the test statistic is 0.7066183 and the corresponding p value is 0. Since, the p-value is less than 0.05, we reject the null hypothesis and conclude that the residuals are auto correlated. Also, the value of D-W statistic is approx. 0.7 which is close to 0 showing high chances of high positive autocorrelation.

In the next section, we will focus on feature selection, where we aim to identify the optimal set of variables for our analysis.

4.1 Feature Selection

Feature selection is a crucial step in constructing models as it involves identifying a subset of relevant features from a larger dataset. In this report, we explore the process of variable selection on a life expectancy dataset using various methods such as best subset, forward inclusion, and backward elimination. Additionally, we utilize four metrics, including residual sum of squares (rss), adjusted R^2 (adjr2), Mallow’s Cp (cp), and Bayesian information criterion (bic), to determine the most relevant subset of variables. By examining the differences in the number of selected features across these criteria, we can determine which criterion yields a more parsimonious model. we aim to identify the criteria that best align with our goal of selecting a subset of features that captures the essential information while minimizing redundancy. Ultimately, this enables us to construct a robust and interpretable model for predicting life expectancy.

We will use 3 regression subset methods to come up with the most relevant subset of features namely:

  • Forward Selection: We begin with an empty model and iteratively add the most significant feature based on the chosen criterion (BIC, Cp, or adjusted R^2) until a stopping condition is met.

  • Backward Selection: We start with all features and iteratively remove the least significant feature based on the selected criterion until a stopping condition is met.

  • Mixed Selection: This method combines forward and backward selection, iteratively adding and removing features based on the chosen criterion until a stopping condition is satisfied.

The different metrics used to evaluate the best subsets of variables provided varying recommendations:

  • RSS: 16 variables

  • adjr2: 16 variables

  • Cp: 15 variables

  • BIC: 13 variables.

Based on these metrics, we have decided to prioritize the BIC as the criterion for selecting the best subset. This is because the BIC tends to penalize models with a larger number of variables more heavily. As a result, the BIC generally favors smaller models compared to Cp and AIC. In this analysis, we specifically exclude the AIC criterion since Cp and AIC yield equivalent results in terms of selecting the same model. Therefore, by considering the BIC metric, we can achieve a balance between model complexity and goodness of fit, resulting in the selection of a subset with a minimum number of variables, as anticipated.

We generate visualizations to compare the performance of different subsets based on the R-squared, adjusted R-squared, Cp, and BIC metrics. Each plot represents the number of predictors on the x-axis and the respective metric on the y-axis. Metrices values for different combinations of predictor variables in a multivariate regression model. Each row of the figure has black boxes (the variable is used in the model) and white boxes (the variable is not used) and it represents a separate multivariate model with its own metrice value.

Conclusion: Through the application of regression subset methods, we successfully perform variable selection and evaluate the model performance using multiple metrics. The plots assist in understanding the impact of different subsets on model fit, while the coefficient analysis sheds light on the significance of predictors in explaining life expectancy. These findings contribute to the development of a robust and interpretable model for predicting life expectancy based on the available dataset.

## Number of Optimal Features by forward selection: 13
## Number of Optimal Features by backward selection: 13
## Number of Optimal Features by mixed techniques: 13

The feature selection analysis, incorporating forward, backward, and mixed techniques, consistently identifies 13 features as the optimal subset for building a predictive model. These selected features showcase their importance in accurately predicting the target variable, enhancing the model’s performance and interpretability. By focusing on these 12 features, we can create a streamlined and efficient model that avoids unnecessary complexity and reduces the risk of overfitting. Additionally, we examine the selected variables and their coefficients for each method to gain further insights into the model’s behavior.

Overall, when the forward, backward, and mixed methods yield the same coefficients for specific variables, it strengthens the evidence for the importance and reliability of those variables in predicting the target variable. It provides consistency, stability, and confidence in the selected features, which are crucial for building effective and interpretable regression models.

4.2 Model data

Initially, we apply a simple linear model to the dataset using selected features. We then compare the results obtained from this model with those of a linear model trained on the entire set of features. Subsequently, we explore the use of ridge regression and lasso regression models, incorporating shrinkage parameters, to observe any differences in the outcomes.

4.2.1 Simple linear model on selected features

## RMSE:  0.4600969
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3795 -0.2376  0.0182  0.2688  1.7982 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.266538   0.029828  -8.936  < 2e-16 ***
## Adult.Mortality                  0.256353   0.013825  18.542  < 2e-16 ***
## Hepatitis.B                      0.032883   0.011899   2.764 0.005771 ** 
## BMI                             -0.070112   0.013442  -5.216 2.02e-07 ***
## Diphtheria                      -0.085438   0.014970  -5.707 1.32e-08 ***
## HIV.AIDS                         0.195127   0.012229  15.957  < 2e-16 ***
## Income.composition.of.resources -0.137560   0.017278  -7.961 2.84e-15 ***
## Measles                          0.040019   0.011324   3.534 0.000419 ***
## percentage.expenditure          -0.081826   0.011594  -7.058 2.33e-12 ***
## Polio                           -0.065741   0.013873  -4.739 2.30e-06 ***
## Schooling                       -0.248588   0.018170 -13.681  < 2e-16 ***
## StatusDeveloped                  0.326859   0.034164   9.567  < 2e-16 ***
## thinness..1.19.years             0.049813   0.013004   3.831 0.000132 ***
## Total.expenditure               -0.008764   0.011094  -0.790 0.429643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4612 on 1981 degrees of freedom
## Multiple R-squared:  0.7877, Adjusted R-squared:  0.7864 
## F-statistic: 565.5 on 13 and 1981 DF,  p-value: < 2.2e-16

4.2.2 Simple linear model on whole features

## RMSE 0.4601041
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33948 -0.23328  0.02395  0.27052  1.77515 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.2404638  0.0320903  -7.493 1.01e-13 ***
## Adult.Mortality                  0.2554388  0.0138661  18.422  < 2e-16 ***
## infant.deaths                    0.0022790  0.0140311   0.162 0.870988    
## Alcohol                         -0.0139939  0.0138083  -1.013 0.310974    
## percentage.expenditure          -0.0859182  0.0116764  -7.358 2.72e-13 ***
## Hepatitis.B                      0.0312245  0.0120290   2.596 0.009508 ** 
## Measles                          0.0420960  0.0126921   3.317 0.000927 ***
## BMI                             -0.0677261  0.0134526  -5.034 5.23e-07 ***
## Polio                           -0.0663533  0.0138561  -4.789 1.80e-06 ***
## Total.expenditure                0.0006514  0.0114387   0.057 0.954590    
## Diphtheria                      -0.0831358  0.0149885  -5.547 3.30e-08 ***
## HIV.AIDS                         0.1954656  0.0122778  15.920  < 2e-16 ***
## GDP                             -0.0338749  0.0128747  -2.631 0.008576 ** 
## Population                      -0.0095069  0.0143559  -0.662 0.507901    
## thinness..1.19.years             0.0486770  0.0144623   3.366 0.000778 ***
## Income.composition.of.resources -0.1385853  0.0172701  -8.025 1.73e-15 ***
## Schooling                       -0.2496408  0.0185953 -13.425  < 2e-16 ***
## StatusDeveloped                  0.2944134  0.0371639   7.922 3.87e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4603 on 1977 degrees of freedom
## Multiple R-squared:  0.789,  Adjusted R-squared:  0.7872 
## F-statistic: 434.9 on 17 and 1977 DF,  p-value: < 2.2e-16

Simple Linear Model Results

Model R-squared Adjusted R-squared Mean Squared Error
Reduced Model 0.7877 0.7864 0.4600969
Full Model 0.789 0.7872 0.4601041

4.2.3 Ridge Regression Model

## 
##  F test to compare two variances
## 
## data:  (y_test - predictions) and (predictionsEDA - y_testEDA)
## F = 1.0265, num df = 849, denom df = 838, p-value = 0.7041
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8968019 1.1749280
## sample estimates:
## ratio of variances 
##           1.026519
Comparison of Variances between Reduced Model and Full Model

The F-test was conducted to compare the variances of the reduced model and full model. The test yielded an F-statistic of 1.026519., with a p-value of 0.7041. The results indicate that there are no significant differences in variances between the two models. Therefore, we conclude that the variances observed are likely due to random variation.

Ridge Regression Model Results

Model Mean Squared Error R-squared Adjusted R-squared
Reduced Model 0.214421 0.7877608 0.7844604
Full Model 0.2100862 0.7958114 0.7915834

4.2.4 Lasso Regression Model

## 
##  F test to compare two variances
## 
## data:  (y_test - predictions) and (predictionsEDA - y_testEDA)
## F = 1.0331, num df = 849, denom df = 838, p-value = 0.6365
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9025581 1.1824694
## sample estimates:
## ratio of variances 
##           1.033108
Comparison of Variances between Reduced Model and Full Model

The F-test was conducted to compare the variances of the reduced model and full model. The test yielded an F-statistic of 1.0331, with p-value of 0.6365. The results indicate that there are no significant differences in variances between the two models. Therefore, we conclude that the variances observed are likely due to random variation.

Lasso Regression Model Results

Model Mean Squared Error R-squared Adjusted R-squared
Reduced Model 0.2139883 0.7881891 0.7848954
Full Model 0.2082843 0.7975627 0.793371

4.2.5 Polynomial Regression model

Polynomial Regression model on selected features

Polynomial Regression model on all Features

Combination for different variables different degree on selected features

Combination for different variables different degree on all features

Polynomial Regression Model Results

Model Type Degree Mean Squared Error R-squared Adjusted R-squared p-value
Polynomial Reduced Model 3 0.4036827 0.8529 0.8501 2.2e-16
Polynomial Full Model 3 0.4036827 0.8529 0.8501 2.2e-16
Polynomial Reduced Model mixed 0.4112296 0.8471 0.8454 2.2e-16
Polynomial Full Model mixed 0.4112296 0.8471 0.8454 2.2e-16

5. Conclusions

5.1 Regression Model Results

Model Type Mean Squared Error R-squared Adjusted R-squared
Lasso Regression Reduced Model 0.2139883 0.7881891 0.7848954
Lasso Regression Full Model 0.2082843 0.7975627 0.793371
Ridge Regression Reduced Model 0.214421 0.7877608 0.7844604
Ridge Regression Full Model 0.2100862 0.7958114 0.7915834
Simple Linear Model Reduced Model 0.4600969 0.7877 0.7864
Simple Linear Model Full Model 0.4601041 0.789 0.7872
Polynomial D=3. Reduced Model 0.4036827 0.8529 0.8501
Polynomial D=3. Full Model 0.4036827 0.8529 0.8501
Polynomial MIXED Reduced Model 0.4112296 0.8471 0.8454
Polynomial MIXED Full Model 0.4112296 0.8471 0.8454

Based on these metrics, we can see that the Lasso Regression model has a slightly lower MSE and a slightly higher R-squared value compared to the Ridge Regression model. This suggests that the Lasso Regression model performs slightly better in terms of prediction accuracy and explaining the variance in the target variable.

However, it’s important to note that the difference between the models is relatively small. Further analysis, such as cross-validation or hypothesis testing, could provide additional insights into the statistical significance and stability of the model performance.

In addition, the full models (both Lasso and Ridge Regression) tend to perform slightly better than the reduced models. They have lower MSE values and higher R-squared and adjusted R-squared values, indicating better overall performance in terms of prediction accuracy and capturing the variance in the target variable. We can conclude that the Feature selection Model provides good results given that it decreases the number of features. it’s important to consider other factors such as model complexity and interpretability when selecting the best model for a particular application.

The polynomial models with degree 3 (both reduced and full) have similar MSE values, indicating a good fit to the data. The R-squared and adjusted R-squared values for these models are relatively high, suggesting a good explanation of the variance in the data. The polynomial mixed degree models also perform well, although slightly worse in terms of MSE compared to the degree 3 models. The R-squared and adjusted R-squared values for the mixed degree models are slightly lower than those of the degree 3 models but still indicate a reasonable fit. In conclusion, based on the given results, the Lasso Regression model on the whole dataset (Full Model) appears to be the better choice among the three models for predicting the target variable.

Then we can say that the main factors that affect life expectancy are:

  • Status
  • Adult.Mortality
  • percentage.expenditure
  • Hepatitis.B
  • Measles
  • BMI
  • thinness.5.9.years
  • Polio
  • Diphtheria
  • HIV.AIDS
  • Income.composition.of.resources
  • Schooling

We also came to the following conclusions:

  • Education has a significant impact on life expectancy

  • Life Expectancy have negative relationship with drinking alcohol

  • Immunization against Hepatitis B and Diphtheria positively impact on life Expectancy

  • Countries with higher income composition of resources for human development have a better life expectancy.

  • There is no significant difference in proportions of the number of infant deaths and the number of under-five deaths.

  • There is no strong correlation between alcohol consumption and life expectancy

  • Most frequent range for life expectancy is 65–82 Years and the least frequent range is less than 45 years and more than 85 years.

  • Immunization coverage has a significant impact on life expectancy

  • Population doesn't have big impact on life expectancy